Policy gradient in Lipschitz Markov Decision Processes

نویسندگان

چکیده

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches....

متن کامل

Online Markov decision processes with policy iteration

The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thu...

متن کامل

An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions

We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret-the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret O(√T)...

متن کامل

Internal-State Policy-Gradient Algorithms for Partially Observable Markov Decision Processes

Policy-gradient algorithms are attractive as a scalable approach to learning approximate policies for controlling partially observable Markov decision processes (POMDPs). POMDPs can be used to model a wide variety of learning problems, from robot navigation to speech recognition to stock trading. The downside of this generality is that exact algorithms are computationally intractable, motivatin...

متن کامل

A policy gradient method for semi-Markov decision processes with application to call admission control

Solving a semi-Markov decision process (SMDP) using value or policy iteration requires precise knowledge of the probabilistic model and suffers from the curse of dimensionality. To overcome these limitations, we present a reinforcement learning approach where one optimizes the SMDP performance criterion with respect to a family of parameterised policies. We propose an online algorithm that simu...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Machine Learning

سال: 2015

ISSN: 0885-6125,1573-0565

DOI: 10.1007/s10994-015-5484-1